Model Selection

Multimodal pre-training

# Multimodal pre-training

A zero-shot skin disease assessment model based on vision-language pre-training technology, integrating multi-faceted knowledge enhancement to provide an effective tool for skin disease research and diagnosis.

Style 250412.vit Base Patch16 Siglip 384.v2 Webli

A vision model based on the Vision Transformer architecture, trained using SigLIP (Sigmoid Loss for Language-Image Pretraining), suitable for image understanding tasks.

Image Classification

Comp SigLIP So400M

CoMP-MM-1B is a visual foundation model (VFM) that supports native image resolution input, continuously pre-trained based on SigLIP.

Multimodal Fusion

Aimv2 Large Patch14 448.apple Pt

AIM-v2 is an image feature extraction model based on the timm library, designed with large patches for high-resolution image processing.

Image Classification

Aimv2 3b Patch14 448.apple Pt

AIM-v2 is an image encoder model based on the timm library, with a 3B parameter scale, suitable for image feature extraction tasks.

Image Classification

Aimv2 3b Patch14 336.apple Pt

AIM-v2 is an image encoder model based on the timm library, suitable for image feature extraction tasks.

Image Classification

Aimv2 1b Patch14 336.apple Pt

AIM-v2 is an image encoder model developed by Apple, based on a timm-compatible architecture, suitable for image feature extraction tasks.

Image Classification

Resnet101 Clip Gap.openai

ResNet101 image encoder based on CLIP framework, extracting image features through Global Average Pooling (GAP)

Image Classification

Resnet50x4 Clip Gap.openai

ResNet50x4 variant model based on the CLIP framework, designed for image feature extraction

Image Classification

Resnet50 Clip Gap.openai

A ResNet50 variant based on the visual encoder part of the CLIP model, extracting image features through Global Average Pooling (GAP)

Image Classification

Vit Huge Patch14 Clip Quickgelu 378.dfn5b

ViT-Huge image encoder based on CLIP architecture, trained on DFN5B dataset, supports quick GELU activation

Image Classification

Vit Huge Patch14 Clip 378.dfn5b

The visual encoder component of DFN5B-CLIP, based on ViT-Huge architecture, trained with 378x378 resolution images for CLIP model

Image Classification

Vit Base Patch16 Clip 224.dfn2b

Vision Transformer model based on CLIP architecture, featuring DFN2B-CLIP image encoder weights released by Apple

Image Classification

Vit So400m Patch14 Siglip Gap 896.pali2 10b Pt

Vision model based on SigLIP image encoder with global average pooling, part of the PaliGemma2 model

Vit Base Patch16 Siglip 256.webli

A ViT-B-16 image encoder model based on SigLIP, using original attention pooling, suitable for image feature extraction tasks.

Image Classification

Vit Huge Patch14 Clip 224.laion2b

ViT-Huge visual encoder based on the CLIP framework, trained on the laion2B dataset, supports image feature extraction

Image Classification

Vit Base Patch32 Clip 256.datacompxl

Vision Transformer model based on CLIP architecture, specialized in image feature extraction with support for 256x256 resolution input

Image Classification

Vit Base Patch32 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset

Image Classification

Vit Base Patch32 Clip 224.datacompxl

Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained using the DataComp XL dataset

Image Classification

Vit Base Patch16 Clip 224.datacompxl

A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, using ViT-B/16 structure and trained on the DataComp XL dataset

Image Classification

Convnext Xxlarge.clip Laion2b Soup

ConvNeXt-XXLarge image encoder based on the CLIP framework, trained by LAION, suitable for multimodal tasks

Image Classification

Convnext Base.clip Laiona

ConvNeXt Base model based on the CLIP framework, trained on the LAION-Aesthetic dataset, suitable for image feature extraction tasks.

Image Classification

Vit Huge Patch14 Clip 224.metaclip Altogether

CLIP model based on ViT-Huge architecture, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch16 Clip 224.laion400m E31

Vision Transformer model trained on LAION-400M dataset, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch32 Clip 224.laion400m E32

Vision Transformer model trained on LAION-400M dataset, compatible with both OpenCLIP and timm frameworks

Image Classification

Resnet50 Clip.cc12m

CLIP model with ResNet50 architecture trained on the CC12M dataset, supporting zero-shot image classification tasks

Image Classification

Resnet50 Clip.yfcc15m

ResNet50 model trained on the YFCC-15M dataset, compatible with both open_clip and timm frameworks, supporting zero-shot image classification tasks.

Image Classification

Siglip So400m Patch14 224

SigLIP is an improved multimodal model based on CLIP, employing a superior Sigmoid loss function, pre-trained on the WebLI dataset, and suitable for tasks such as zero-shot image classification and image-text retrieval.

Vit Xsmall Patch16 Clip 224.tinyclip Yfcc15m

A compact vision-language model based on CLIP architecture, designed for efficient zero-shot image classification

Image Classification

Vit B 16 Aion400m E32 1finetuned 1

Vision Transformer model based on OpenCLIP framework, fine-tuned for zero-shot image classification tasks

Image Classification

Internvit 6B 448px V1 2

InternViT-6B-448px-V1-2 is a foundational vision model with a feature backbone, comprising 55.4 million parameters, supporting image processing at 448x448 pixels.

Siglip Base Patch16 384

SigLIP is a multimodal model pre-trained on the WebLi dataset, employing an improved sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Siglip Base Patch16 256

SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved Sigmoid loss function, excelling in image classification and image-text retrieval tasks.

The ProtST framework enhances the pre-training and understanding of protein sequences through biomedical text, constructs the ProtDescribe dataset, designs three pre-training tasks, and supports supervised learning and zero-shot prediction.

AltCLIP-m18 is a CLIP model supporting 18 languages for image-text matching tasks.

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind

A CLIP ConvNeXt-XXLarge model trained on the LAION-2B dataset, implemented using the OpenCLIP framework, focusing on zero-shot image classification tasks.

Lilt Roberta En Base

Language-independent Layout Transformer (LiLT) provides a LayoutLM-like model for any language by combining pre-trained RoBERTa (English) with a pre-trained language-independent layout transformer (LiLT).

Text Recognition

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase